Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Upstream codebase diff #470

Draft
wants to merge 1,008 commits into
base: main
Choose a base branch
from
Draft

[DO NOT MERGE] Upstream codebase diff #470

wants to merge 1,008 commits into from

Conversation

kzawora-intel
Copy link

@kzawora-intel kzawora-intel commented Nov 6, 2024

Scope of changes:

  • Contiguous PA
  • Multi-step scheduling
  • Automatic prefix caching
  • Padding-aware scheduling/max_num_prefill_seqs
  • Guided decoding fixes
  • FP8 support (INC/w8a8/weights_load_device)
  • ApplyToppTopkScalar sampler optimization
  • LoRA/MultiLoRA support
  • FusedMoE support
  • Model changes (adding mark_steps)
  • Tests
  • FakeHPU mode
  • CI stuff (.jenkins, .github)
  • Lots of minor stuff (RNG, FSDPA flag, reduced block fragmentation)

@@ -0,0 +1,35 @@
name: cpu-test

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help
@kzawora-intel kzawora-intel marked this pull request as draft November 6, 2024 13:49
@kzawora-intel kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Nov 8, 2024
@@ -0,0 +1,45 @@
name: codespell

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help
def test_stateless_process_group(worker):
port1 = get_open_port()
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", port1))

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium test

'' binds a socket to all interfaces.

Copilot Autofix AI about 1 month ago

To fix the problem, we need to bind the socket to a specific interface instead of all interfaces. In this case, we can bind it to the loopback interface 127.0.0.1, which is commonly used for local testing and development. This change will limit the socket to accept connections only from the local machine, reducing the security risks.

Suggested changeset 1
tests/distributed/test_utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tests/distributed/test_utils.py b/tests/distributed/test_utils.py
--- a/tests/distributed/test_utils.py
+++ b/tests/distributed/test_utils.py
@@ -124,3 +124,3 @@
     with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-        s.bind(("", port1))
+        s.bind(("127.0.0.1", port1))
         port2 = get_open_port()
EOF
@@ -124,3 +124,3 @@
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", port1))
s.bind(("127.0.0.1", port1))
port2 = get_open_port()
Copilot is powered by AI and may make mistakes. Always verify output.
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options

sock = socket.socket(family=family, type=socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(addr)

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium

'' binds a socket to all interfaces.

Copilot Autofix AI about 1 month ago

To fix the problem, we need to ensure that the socket binds to a specific interface rather than all interfaces. This can be achieved by modifying the create_server_socket function to check if the provided address is empty or 0.0.0.0 and raise an error or use a default specific interface instead.

  1. Modify the create_server_socket function to validate the address.
  2. If the address is empty or 0.0.0.0, raise an error or use a default specific interface.
  3. Update the run_server function to handle the potential error raised by create_server_socket.
Suggested changeset 1
vllm/entrypoints/openai/api_server.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py
--- a/vllm/entrypoints/openai/api_server.py
+++ b/vllm/entrypoints/openai/api_server.py
@@ -612,2 +612,5 @@
 def create_server_socket(addr: Tuple[str, int]) -> socket.socket:
+    if addr[0] in ("", "0.0.0.0"):
+        raise ValueError("Binding to all interfaces is not allowed. Please specify a valid IP address.")
+
     family = socket.AF_INET
@@ -640,3 +643,7 @@
     sock_addr = (args.host or "", args.port)
-    sock = create_server_socket(sock_addr)
+    try:
+        sock = create_server_socket(sock_addr)
+    except ValueError as e:
+        logger.error(e)
+        return
 
EOF
@@ -612,2 +612,5 @@
def create_server_socket(addr: Tuple[str, int]) -> socket.socket:
if addr[0] in ("", "0.0.0.0"):
raise ValueError("Binding to all interfaces is not allowed. Please specify a valid IP address.")

family = socket.AF_INET
@@ -640,3 +643,7 @@
sock_addr = (args.host or "", args.port)
sock = create_server_socket(sock_addr)
try:
sock = create_server_socket(sock_addr)
except ValueError as e:
logger.error(e)
return

Copilot is powered by AI and may make mistakes. Always verify output.
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[' and containing many repetitions of 'AA(),'.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=,'.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ',A='.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA= ),A('.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[' and containing many repetitions of 'AA()'.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=,'.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ',A='.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(' and containing many repetitions of 'AA=)A('.
# Llama3.2 models more reliable.

TOOL_CALL_REGEX = re.compile(
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]",

Check failure

Code scanning / CodeQL

Inefficient regular expression High

This part of the regular expression may cause exponential backtracking on strings starting with '[A(A=' and containing many repetitions of ')A(A='.
mfylcek and others added 24 commits November 25, 2024 09:36
Limit decode bucket size to num_hpu_blocks
Signed-off-by: Sanket Kale <[email protected]>
Co-authored-by: Sanket Kale <[email protected]>
Co-authored-by: mgoin <[email protected]>
Fixes issue with multi LoRA during `profile_run`.
We are seeing 10% performance regression in the llama-based model due to
vllm-project#10239. The mark_step()
function needs to be configured differently for each model to achieve
the best performance. For some models, mark_step() for every decoder
step would be optimal, but for other models, it's better to run it every
n-th step. We are adding a counter to only register the hook for every
n-th step, which can be configured with VLLM_CONFIG_HIDDEN_LAYERS
kzawora-intel and others added 5 commits December 11, 2024 13:11
i think inception was a decent movie overall
Signed-off-by: Konrad Zawora <[email protected]>
With this patch, mp executor does not hang at the end of application out
of the box, and exits gracefully.
New useful checks were added, and we're not running them on habana_main
per PR. This PR fixes that.
@@ -0,0 +1,32 @@
name: Lint documentation

Check failure

Code scanning / Scorecard

Token-Permissions High

score is 0: no topLevel permission defined
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help
kdamaszk and others added 24 commits December 12, 2024 09:41
Without this change we can observe below error:
```
[rank0]:   File "/software/users/kdamaszke/repos/vllm-fork/vllm/model_executor/models/mllama.py", line 959, in forward
[rank0]:     full_text_row_masked_out_mask = full_text_row_masked_out_mask.view(
[rank0]: RuntimeError: shape '[4, -1, 1]' is invalid for input of size 3
```
It occurs when one of the requests is removed from the batch earlier. In
that case, language model is still working on the shapes padded to the
bucketed batch size, while encoder input doesn't. This change is
aligning the batch size on `encoder_seq_lens` to the expected one.
As now one_hot operator has implementation for eager and compile mode
the workaround is not needed any longer.
Fix for batch size padding in multi-step scheduling by
@SanjuCSudhakaran.

Co-authored-by: Sanju C Sudhakaran <[email protected]>
during warmup the inference mode is used, but at runtime it's
overwritten by inference mode - this causes recompilations due to
dispatch key mismatch in torch.compile.
This switches the no_grad mode to inference_mode from base class.

---------

Co-authored-by: Rafal Litka <[email protected]>
Generic name discovery for rope.prepare_cos_sin. It fixes errors in
models that don't follow a specific naming hierarchy
Add new member to list of codeowners.
#566 breaks long-contexts +
LoRA flow.

This assumes caching sin-cos buffer for first decoder layer is
sufficient to handle all cases, which is not the applicable for
long-context + LoRA.

This PR ignores `_prepare_cos_sin` call prior to HpuModelAdapter forward
in long-context + LoRA flow.
This PR solves the "ModuleNotFoundError: No Module named torch.hpu" in
test_lora_manager_hpu.py::test_from_lora_tensors by importing
"habana_frameworks.hpu" to lora model.

Co-authored-by: Vivek Goel <[email protected]>
This PR updates `test_layers_hpu.py` and `test_lora_hpu.py` to align
with `PunicaWrapper` refactor.

Related PR: #614
Error reported in https://jira.habana-labs.com/browse/SW-212516

Found two recent merged PR breaks down Spec Decode functionality:

1. #491 overrides existing
workerwrapperBase design for speculative decoding.
```
if model_runner_cls is not None:
    ModelRunnerClass = model_runner_cls
```
is not needed since we now use codes as below for init model_runner_cls
to follow upstream design.
```
if model_runner_cls is not None:
            self.model_runner = model_runner_cls(self.model_runner)
```

2. #566 is not working in Spec
Decode Eagle mode
Due to input tensors is now different to the pre-assumption that
decode_fwd only provide one token per seq. Spec Decode provides multiple
candidates tokens as q.
To fix that, added a new ENV - "**VLLM_COS_SIN_RECOMPUTE**=true", need
to use it to trigger recompute to cos and sin for spec decode.

---------

Signed-off-by: Chendi.Xue <[email protected]>
This PR is to fix the slow sampling in HPU when repetition_penalty is
set in the sampling parameters.

It replaces the slow pytorch API on HPU and mitigate the dynamic shapes
in the code.

Without this PR:
SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0,
repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1,
min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[],
include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024,
min_tokens=0, logprobs=None, prompt_logprobs=None,
skip_special_tokens=True, spaces_between_special_tokens=True,
truncate_prompt_tokens=None, guided_decoding=None)
Warming up...
Profiling iterations: 100%|5/5 [03:32<00:00, 42.49s/it]
Avg latency: 42.49439047839987 seconds
10% percentile latency: 11.322476224999628 seconds
25% percentile latency: 11.32563829100036 seconds
50% percentile latency: 11.331052645000455 seconds
75% percentile latency: 11.333669468998778 seconds
90% percentile latency: 104.8302020711999 seconds
99% percentile latency: 160.92812163252054 seconds

With PR:
Avg latency: 11.038154767800005 seconds
10% percentile latency: 10.964674918200398 seconds
25% percentile latency: 10.964709408001 seconds
50% percentile latency: 10.966433088000485 seconds
75% percentile latency: 10.967024742998547 seconds
90% percentile latency: 11.18358270219942 seconds
99% percentile latency: 11.313517477719943 seconds

Testing code:

https://github.com/ccrhx4/huanxing.vllm-fork/blob/slow_repetition_penalty/benchmarks/reproduce.sh

The only difference about this PR and
#442 is that I do not enable
pin_memory as this feature readiness is poor on HPU.
Original : #599

We have a case where topk=1, and topp=<1.

Adding special handling for the case topk=1 and handle_duplicate=0 (by
default handle_duplicate=0, to support num-scheduling-steps)
This PR fixes a bug that results in the following RuntimeError when APC
is enabled.
```
ERROR 12-19 02:30:05 engine.py:140]   File "/workspace/vllm/worker/hpu_model_runner.py", line 854, in _prepare_prompt
ERROR 12-19 02:30:05 engine.py:140]     if prefix_block_list_tensor:
ERROR 12-19 02:30:05 engine.py:140] RuntimeError: Boolean value of Tensor with more than one value is ambiguous
```
Add llava suport with a prompt for benchmark throughput using images
This is a updated version from
#650.


Coupled with [Use FusedSDPA for MllamaVisionSdpaAttention
#620], these two issues
arising when running llama3.2 vision model can be resolved:

GC fail when batchsize>1 on Gaudi3.
Increased device memory consumption with Torch 2.5 compared to Torch
2.4.

---------

Signed-off-by: yan ma <[email protected]>
Co-authored-by: yisonzhu <[email protected]>
Use `FusedSDPA` instead of regular `F.scaled_dot_product_attention` in
`MllamaVisionSdpaAttention` module.

The difference between these two ops is precision, since
`F.scaled_dot_product_attention` converts the input to float32 and
performs all operations on this data type, while `FusedSDPA` does not.
However, it change accuracy only from 0.449 to 0.446 on accuracy test
based on MMMU dataset and lm-evaluation-harness, while improves single
prompt performance from ~550ms to ~100ms.
Fix warmup for encoder-decoder model runner by limiting the number of
dummy cross attention blocks to available blocks. Without this we will
encounter an error in CrossAttention due to lack of available blocks.
Remove WA using torch.ops.hpu.fp8_gemm_v2 for hpu.
This PR adds changes required to enable MSS with LoRA flow. Checked
there are no regressions using vllm-fork CI job
https://tf-jenkins-ctrl01.habana-labs.com/job/vLLM/view/CI-jobs/job/vLLM-CI-Pipeline/429/
- Added actionlint.yaml to allow usage of self-hosted runners (without
it actionlint will throw error) - I also tried to disable some of
shellcheck warnings/errors but couldn't do that so probably this PR
should be merged even though actionlint is failing
- Update Trigger Jenkins workflow - now it will contain 4 jobs:
1. Dependency Scan - will fail the job if a dependency with high
severity vulnerability will be part of the PR
2. CodeQL Scan - scan the python code itself 
3. Calculate Tests To Trigger - will read the .jenkins/test_config.yaml
file and based on it trigger all the tests configured on it
4. Tests - The tests running on Gaudi resources
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.